Libraries
library(tibble)
library(tidyverse)
library(tidyr)
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(kableExtra)
Read in the gapminder_clean.csv data as a tibble using read_csv.
gmclean <- read_csv("gapminder_clean.csv")
Filter the data to include only rows where Year is 1962 and then make a scatter plot comparing ‘CO2 emissions (metric tons per capita)’ and gdpPercap for the filtered data.
gmclean1962 <- gmclean %>%
filter(Year == 1962, continent %in% c("Africa", "Americas", "Asia", "Europe", "Oceania"))
co2gdp <- ggplot(gmclean1962, aes(x = `CO2 emissions (metric tons per capita)`, y = gdpPercap, color = continent)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
ggtitle("gdpPercap vs CO2 Emissions")
ggplotly(co2gdp)
On the filtered data, calculate the correlation of ‘CO2 emissions (metric tons per capita)’ and gdpPercap. What is the correlation and associated p value?
The code show below provides the correlation and associate p value. Due to the linear shape of the graph above, the pearson correlation was used to calculate the linear relationship between these two continuous variables in both question 3a and 4a.
cor.result <- cor.test(formula = ~ gmclean1962$`CO2 emissions (metric tons per capita)` + gmclean1962$gdpPercap, data = gmclean1962)
The correlation between the two variables is 0.9260817, which implies a positive linear correlation. With a significance level of α = 0.05 and an associated p value 1.1286792^{-46}, we have sufficient evidence to claim that the correlation is statistically significant.
On the unfiltered data, answer “In what year is the correlation between ‘CO2 emissions (metric tons per capita)’ and gdpPercap the strongest?” Filter the dataset to that year for the next step. #should only get 10 values for each year
gmstrong <- gmclean %>%
group_by(Year) %>%
summarise(crltn = max(cor(`CO2 emissions (metric tons per capita)`, gdpPercap, use = "complete.obs"))) %>%
arrange(desc(crltn))
kbl(head(gmstrong)) %>%
kable_styling()
| Year | crltn |
|---|---|
| 1967 | 0.9387918 |
| 1962 | 0.9260817 |
| 1972 | 0.8428986 |
| 1982 | 0.8166384 |
| 1987 | 0.8095531 |
| 1992 | 0.8094316 |
As shown above, the year with the strongest correlation between ‘CO2 emissions (metric tons per capita)’ and gdpPercap is 1967.
Using plotly, create an interactive scatter plot comparing ‘CO2 emissions (metric tons per capita)’ and gdpPercap, where the point size is determined by pop (population) and the color is determined by the continent. You can easily convert any ggplot plot to a plotly plot using the ggplotly() command.
# the data filtered to 1967
gmclean1967 <- gmclean %>%
filter(Year == 1967, continent %in% c("Africa", "Americas", "Asia", "Europe", "Oceania"))
co2gdp67 <- ggplot(gmclean1967, aes(x = `CO2 emissions (metric tons per capita)`, y = gdpPercap, color = continent)) +
geom_point(aes(size = pop, alpha = 0.5)) +
scale_size(guide = "none") +
scale_alpha(guide = "none") +
scale_x_log10() +
scale_y_log10() +
ggtitle("gdpPercap vs CO2 Emissions")
ggplotly(co2gdp67)
What is the relationship between continent and ‘Energy use (kg of oil equivalent per capita)’? (stats test needed)
The relationship each continent has with its own ‘Energy use (kg of oil equivalent per capita)’ differs significantly from how much energy is used in other continents. As there are three or more independent categorical variables in the continent column of the dataset, I used the Kruskal-Wallis test. The level of significance is set to α = 0.05.
gmcleane <- gmclean %>%
rename(energy = "Energy use (kg of oil equivalent per capita)")
ktest <- kruskal.test(
energy ~ continent,
data = gmcleane
)
The p value found above is 1.0124039^{-67}, which implies that there is a significant difference between energy use of each continent.
stest <- shapiro.test(gmclean$`Energy use (kg of oil equivalent per capita)`)
The Kruskal-Wallis test was used as it is a non-parametric test and the data used is not distributed normally. This test can calculate significance for data with more than two groups as well. The Shapiro-Wilk normality test was used to see if the data is significantly different from a normal distribution. As the p-value (5.058735^{-48}) from the code above is less than (α=0.05), we can assume non-normality.
Is there a significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990? (stats test needed)
gmclean1990 <- gmclean %>%
filter(Year > 1990)
europe90 <- gmclean1990 %>%
filter(continent == "Europe")
eurogoods <- europe90$`Imports of goods and services (% of GDP)`
asia90 <- gmclean1990 %>%
filter(continent == "Asia")
asiagoods <- asia90$`Imports of goods and services (% of GDP)`
wtest <- wilcox.test(eurogoods, asiagoods, alternative = "two.sided")
To test if there is a significant difference between two variables of interest, I ran a wilcox test. With the p-value(0.7867013) shown above, there isn’t a significant difference between the imports of goods and services between the two continents as the significance level (α) is set to 0.05.
ggplot(europe90, aes(x = europe90$`Imports of goods and services (% of GDP)`)) +
geom_histogram(binwidth = 5) +
labs(title = "Europe", x = "`Imports of goods and services (% of GDP)`", y = "count")
ggplot(asia90, aes(x = asia90$`Imports of goods and services (% of GDP)`)) +
geom_histogram(binwidth = 5) +
labs(title = "Asia", x = "`Imports of goods and services (% of GDP)`", y = "count")
The wilcox test was chosen as the data is not normally distributed. The non-normal distribution can be seen here for the ‘Imports of goods and services (% of GDP)’ across both Asia and Europe.
What is the country (or countries) that has the highest ‘Population density (people per sq. km of land area)’ across all years?
rankpd <- gmclean %>%
rename(country = `Country Name`) %>%
rename(popd = `Population density (people per sq. km of land area)`) %>%
select(country, Year, popd) %>%
na.omit() %>%
group_by(Year) %>%
mutate(r = rank(popd, ties.method = "average")) %>%
arrange(desc(r)) %>%
ungroup() %>%
filter(r == max(r))
kbl(head(rankpd)) %>%
kable_styling()
| country | Year | popd | r |
|---|---|---|---|
| Macao SAR, China | 2002 | 16451.04 | 262 |
| Monaco | 2007 | 17523.00 | 262 |
The countries with the highest population density across all years in the dataset are shown above. Macao SAR, China had the highest average ranking in population density, with 1.6451037^{4} people per sq. km of land area.
What country (or countries) has shown the greatest increase in ‘Life expectancy at birth, total (years)’ between 1962 and 2007?
The countries with the greatest increase in life expectancy at birth are shown below.
explife <- gmclean %>%
group_by(`Country Name`) %>%
filter(Year %in% c(1962,2007)) %>%
select(`Country Name`, Year, `Life expectancy at birth, total (years)`) %>%
spread(key = Year, value = `Life expectancy at birth, total (years)`) %>%
mutate(inc = `2007`-`1962`) %>%
select(-`1962`, -`2007`) %>%
arrange(desc(inc))
kbl(head(explife)) %>%
kable_styling()
| Country Name | inc |
|---|---|
| Maldives | 36.91615 |
| Bhutan | 33.19895 |
| Timor-Leste | 31.08515 |
| Tunisia | 30.86076 |
| Oman | 30.82310 |
| Nepal | 30.59963 |